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What is claimed is: 

K^A^document descriptor extraction method comprising the steps of: 
generalizing input sequences associated with a document to develop general 
sequences, said input sequences reflecting the structure of a document; 
5 factoring said input sequences and said general sequences to develop factored 

sequences; 

^ selecting a document descriptor from said input sequences, said general sequences, and 

St. 

% said factored sequences using minimum descriptor length (MDL) principles. 
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=40 2. The method of claim 1, wherein said selecting step comprises the steps of: 

L encoding said input sequences, said general sequences, and said factored sequences; 

1 and 

□ selecting a document descriptor which encompasses all of said input sequences and 

exhibits a minimum MDL cost. 

15 

3. The method of claim 2, wherein said encoding step employs an algorithm which 
applies a set of rules comprising: 

seq(D,s) = e if D=s, if D does not contain metacharacters; 
seq(D,„D k , s^.sO = seqCD,^) ...seqp^sj; 
20 seq(D 1 |...|D n> s) = i seq(Di,s); 

seq(D*,S! ...sj - {k seq(D,s,). ..seq^sj if k>0; 0 otherwise}; 
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wherein D is a sequence of symbols, s is a sequence, and i is an index of a regular 
expression that the corresponding sequence s matches, wherein log m bits are needed. to 
encode index i. 

4. The method of claim 3, wherein said minimum MDL cost is determined by 
employing an algorithm to solve a facility location problem (FLP), said FLP modified to 
compute said minimum MDL cost of potential document descriptors. 

5. The method of claim 4, wherein said document descriptor is a document type 
descriptor (DTD), and said document is an extensible Markup Language (XML) document. 

6. The method of claim 5, wherein said minimum MDL cost comprises summing a first 
length of bits describing the DTD and a second length of bits for encoding the sequences. 

document descriptor extraction method comprising the steps of: 
generalizing input sequences to develop general sequences, said input sequences 

reflecting the structure of data within a document; 

selecting a document descriptor from said input sequences and said general sequences 

using minimum descriptor length (MDL) principles. 

8. The method of claim 7, wherein said selecting step comprises the steps of: 
encoding said input sequences and said general sequences; and 
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selecting a document descriptor which encompasses all of said input sequences and 
exhibits a minimum MDL cost. 



9. The method of claim 8, wherein said encoding step employs an algorithms which 
applies a set of rules comprising: 

seq(D,s) = e if D=s, if D does not contain metacharacters; 

seq(D!...D k , s v ..s^) = seq(D 1 ,s 1 )...seq(D k ,s k ), if D is a concatenation of Dj.-.D^ 

seq(D 1 |...|D m ,s) = iseq(D i) s); 

seqC^Si.-Sk) = {k seq(D,s,)...seq(D,s k ) if k>0; 0 otherwise}; 

wherein D is a sequence of symbols, s is a sequence, and i is an index of a regular 
expression that the corresponding sequence s matches, wherein log m bits are needed to 
encode index L 

10. The method of claim 9, wherein said minimum MDL cost is determined by 
employing an algorithm to solve a facility location problem (FLP), wherein said FLP is 
modified to compute said minimum MDL cost of potential document descriptors. 

1 1 . The method of claim 10, wherein said document descriptor is a document type 
descriptor (DTD), and said document is an extensible Markup Language (XML) document. 

12. The method of claim 11, wherein said minimum MDL cost comprises summing a 
first length of bits describing the DTD and a'second length of bits for encoding the sequences. 
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13. The method of claim 7, further comprising the step of: 
factoring said input sequences and said general sequences to develop factored 
sequences, wherein said factored sequences are available for said step of selecting; 

J4^A computer-readable medium encoded with a computer program for generalizing 
input sequences to develop general sequences, said computer program comprising: 
a discover OR patterns procedure; 
a discover sequence patterns procedure; and 

a generalize procedure which calls said discover sequence patterns procedure and calls 
said discover OR patterns procedure, wherein said discover OR patterns procedure is nested 
within said discover sequence patterns procedure. 

15. The computer-readable medium of claim 14, said computer program further 
comprising a partition procedure called by said discover OR patterns procedure. 

yt^k method for generalizing input sequences to develop general sequences 
comprising the steps of: 

discovering OR patterns among said input sequences; and 

discovering sequence patterns among said input sequences and OR patterns. 

17. The method of claim 16, wherein said step of discovering OR patterns comprises 
the step of partitioning said input sequences. 
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jrf^A document descriptor extraction method comprising the steps of: 
generalizing input sequences, said generalizing step comprising the steps of: 
discovering OR patterns among said input sequences, and 
discovering sequence patterns among said input sequences and OR patterns; 

and 

selecting a document descriptor from said input sequences and said general sequences. 

19. The method of claim 18, wherein said discovering OR patterns step comprises the 
step of partitioning said input sequences. 

20. The method of claim 19, further comprising the steps of: 

factoring said input sequences and said general sequences to develop factored 
sequences, wherein said factored sequences are available to said step of selecting, 

21. The method of claim 20, wherein said step of selecting employs minimum 
descriptor length (MDL) principles. 

22. The method of claim 21, wherein said document descriptor is a document type 
descriptor (DTD) and said document is an extensible Markup Language (XML) document. 
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